Prepare mktables for Unicode 15.1 and 16.0 #23133

khwilliamson · 2025-03-18T21:05:12Z

perldelta not needed until the actual releases are incorporated.

This set of changes does not require a perldelta entry.

Leont · 2025-03-19T00:34:20Z

lib/unicore/mktables

    if (defined (my $bmg = property_ref('Bidi_Mirroring_Glyph'))) {
        $bmg->set_to_output_map($EXTERNAL_MAP);
        $bmg->set_range_size_1(1);
    }

    property_ref('Numeric_Value')->set_to_output_map($OUTPUT_ADJUSTED);

+    # These two properties have no short names and the file names for them
+    # clash in DOS 8.3.  Work around this by creating shorter file names that


Where are we still limited by 8.3?

On IRC the other day, I asked if we were still limited, and the answer was yes.

For unicode filenames yes, but for ASCII filenames we don't AFAIK.

I'd prefer to leave this as-is, since it is trivial to do, just in case. And I have WIP which should get rid of them altogether.

lib/unicore/mktables

jkeenan

The commit message for aa6faba has 2 misspellings. infrastructue lacks the second r. In incoroporated the second o needs removal.

jkeenan · 2025-04-01T19:06:06Z

This p.r. for Unicode mktables did not make it into the March 20 dev release. Does that mean we have to defer it to the 5.43 dev cycle?

Leont · 2025-04-01T22:28:49Z

This p.r. for Unicode mktables did not make it into the March 20 dev release. Does that mean we have to defer it to the 5.43 dev cycle?

The change isn't really user visible, it would only affect people who would want to patch in a more recent Unicode version.

jkeenan

@khwilliamson there's one unresolved conversation in this p.r. If you mark that resolved, then I think this is okay to merge.

khwilliamson · 2025-04-03T00:27:13Z

There are more commits coming

Add comments, and rewrap comment lines to fit 80 columns

Unicode 15.1 introduces this new property, which needs the same special handling as plain NFKC_Casefold does.

@missings

These files are changed in 15.1 to have @missings lines, whereas they didn't before. This leads to some warnings messages, so turn off looking at them, as we do for a number of other files.

We handle it by ignoring this file, new to Unicode 16.0. It consists of lists of characters that, to put it less delicately than Unicode would like, they regret creating. But there are no rules associated with them. It would be nice to have a \p{DoNotEmit} property so that applications could handle situations where this occurs. But I'm fearful that if we did something like this, that Unicode would later come up with something that had the same intention but would be subtly or unsubtly different. That has happened before, to our detriment. So I think we should wait to see what they do do, in future releases.

This includes several new properties, some of which are considered "provisional" by Unicode, which means they can be heavily revised or withdrawn. These properties are designed for use by scholars of hieroglyphics.

These new properties are automatically handled, but there is a problem. They have no short form names. Files are written for them based on their names, and those files are not distinguishable on a DOS 8.3 file system. The solution here is to manually override the automatically generated file names with distinguishable ones.

mktables does a lot of sanity checks on the data it gets fed. One of those is to make sure any \d group of code points is 10 long. This verifies that Unicode has given us enough code points to form 0-9. It assumes that if it got this much right, that their numeric values are also 0-9. This check has uncovered issues with the Unicode Standard in the past. Nowadays, they've cleaned up their act, and it's been many releases since there has been problems. But our checks remain, and I think they should. What happens in Unicode 16.0 was there was a range of \d characters that contain two consecutive groups of 0-9 values. The check could be changed to verify that the count is divisible by 10, but checking for this particular range is a bit safer.

There is already this method for lists of Ranges, so this is is just so callers don't need to know which they are operating on.

khwilliamson · 2025-04-08T13:40:36Z

This has been repushed, with the new hieroglyphic properties now working

changes made as requested

khwilliamson · 2025-04-17T03:50:57Z

I think the PSC @haarg @ap @book should consider if we should ship Perl 5.42 without updating the Unicode version. We are now one major version and one dot release behind. The reason is solely the break properties have been very difficult to update. The rest of the releases update smoothly. The break properties are what matches regular expression constructs \b{gcb}, \X, \b{lb}, \b{wb} The sentence break property is unaffected. These properties have data files, yes, but the actual rules for them come from documentation.

Technically, it is past the deadline for such changes in this development cycle. A program that depends on a particular code point being unassigned could fail when that code point does get assigned to be a specific character. And the new releases assign thousands of new characters. On the other hand, one could argue that such a program is incorrect, as it depends on the stability of something that is inherently unstable and documented as such. (There are some code points that aren't ever going to become characters, so you could use some of those. Or there are ones that Unicode would have to be pretty desperate to assign. such as the one that is in the position to be a capital Greek Final Sigma. But there is no such character, but they left a hole where it would have appeared so as to not mess up the symmetry of the rest of the Greek encoding)

Unicode has changed the line breaking algorithm for some Indic characters. If you relied on the old algorithm your code would break, but on the other hand people would be mad at you for not giving them the results the language dictates. The break algorithms are declared to be unstable by Unicode.

I had hoped to get a PR ready by today, but I ran out of time, though I'm close.

I do think we should make some effort to keep up with Unicode releases.

Leont reviewed Mar 19, 2025

View reviewed changes

lib/unicore/mktables Show resolved Hide resolved

khwilliamson force-pushed the mktables_15.1 branch from 4894f2a to 1f07a91 Compare March 19, 2025 14:49

jkeenan previously requested changes Mar 19, 2025

View reviewed changes

khwilliamson force-pushed the mktables_15.1 branch from 1f07a91 to de01c61 Compare March 20, 2025 00:11

jkeenan reviewed Apr 2, 2025

View reviewed changes

khwilliamson marked this pull request as draft April 3, 2025 00:27

khwilliamson added 4 commits April 7, 2025 16:32

mktables: White-space, comment only

3569950

Add comments, and rewrap comment lines to fit 80 columns

mktables: Handle new property NFKC_Simple_Casefold

7f96e12

Unicode 15.1 introduces this new property, which needs the same special handling as plain NFKC_Casefold does.

mktables: Ignore missings entries in two files

20c6a05

These files are changed in 15.1 to have @missings lines, whereas they didn't before. This leads to some warnings messages, so turn off looking at them, as we do for a number of other files.

khwilliamson force-pushed the mktables_15.1 branch from de01c61 to 8f58648 Compare April 7, 2025 23:39

khwilliamson marked this pull request as ready for review April 7, 2025 23:41

khwilliamson force-pushed the mktables_15.1 branch from 8f58648 to 5b52ed1 Compare April 8, 2025 00:49

khwilliamson added 4 commits April 8, 2025 04:25

mktables: Handle Unicode 16.0 Unikemet.txt file

cef0459

This includes several new properties, some of which are considered "provisional" by Unicode, which means they can be heavily revised or withdrawn. These properties are designed for use by scholars of hieroglyphics.

mktables: Add count() method to Range class

32ee519

There is already this method for lists of Ranges, so this is is just so callers don't need to know which they are operating on.

khwilliamson force-pushed the mktables_15.1 branch from 5b52ed1 to 32ee519 Compare April 8, 2025 10:35

khwilliamson merged commit 4071919 into Perl:blead Apr 18, 2025
33 checks passed

khwilliamson deleted the mktables_15.1 branch April 23, 2025 10:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepare mktables for Unicode 15.1 and 16.0 #23133

Prepare mktables for Unicode 15.1 and 16.0 #23133

khwilliamson commented Mar 18, 2025

Leont Mar 19, 2025

khwilliamson Mar 19, 2025

Leont Mar 19, 2025

khwilliamson Apr 8, 2025

jkeenan left a comment

jkeenan commented Apr 1, 2025

Leont commented Apr 1, 2025

jkeenan left a comment

khwilliamson commented Apr 3, 2025

khwilliamson commented Apr 8, 2025

khwilliamson commented Apr 17, 2025

Prepare mktables for Unicode 15.1 and 16.0 #23133

Prepare mktables for Unicode 15.1 and 16.0 #23133

Conversation

khwilliamson commented Mar 18, 2025

Leont Mar 19, 2025

Choose a reason for hiding this comment

khwilliamson Mar 19, 2025

Choose a reason for hiding this comment

Leont Mar 19, 2025

Choose a reason for hiding this comment

khwilliamson Apr 8, 2025

Choose a reason for hiding this comment

jkeenan left a comment

Choose a reason for hiding this comment

jkeenan commented Apr 1, 2025

Leont commented Apr 1, 2025

jkeenan left a comment

Choose a reason for hiding this comment

khwilliamson commented Apr 3, 2025

khwilliamson commented Apr 8, 2025

khwilliamson commented Apr 17, 2025